STAR (Dobin et al., 2013) is a powerful RNA-seq aligner tool that will map raw RNA-seq reads to a reference genome. STAR can accept single-end or paired-end read (sequence) files as input. Ideally, this will be in FASTQ file format, but FASTA file format can also be accepted by STAR.
|
Single-end FASTQ Examples
|
Paired-end FASTQ Examples
|
||
|---|---|---|---|
| sampleA.fastq.gz | sampleA.fq.gz | sampleA_r1.fastq.gz | sampleA_r1.fq.gz |
| sampleB.fastq.gz | sampleB.fq.gz | sampleA_r2.fastq.gz | sampleA_r2.fq.gz |
| sampleC.fastq.gz | sampleC.fq.gz | sampleB_r1.fastq.gz | sampleB_r1.fq.gz |
| sampleB_r2.fastq.gz | sampleB_r2.fq.gz | ||
| sampleC_r1.fastq.gz | sampleC_r1.fq.gz | ||
| sampleC_r2.fastq.gz | sampleC_r2.fq.gz | ||
Publicly available RNA-seq data of this format can be found using the NCBI sequence read archive (SRA); the NCBI Gene Expression Omnibus (GEO); and other databases. Data coming from the NCBI SRA are conventionally written with the prefix “SRR”. Download this data to any desired folder on SickKids’ HPC cluster. To do this, use wget or curl or another non-interactive download tool supporting HTTP, HTTPS, and FTP protocols.
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR750/005/SRR7506545/SRR7506545_1.fastq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR750/005/SRR7506545/SRR7506545_1.fastq.gz -o SRR7506545_day00_RNA_sequencing_Rep1_hESC_1.fastq.gz
To conserve disk space, leave the raw sequencing data compressed (.gz). STAR can use compressed files.
Perform a quality check of the downloaded reads using FastQC (Andrews, 2010) and determine whether any pre-processing needs to occur by checking the output HTML report for each read.
#!/bin/bash
#PBS -l walltime=23:59:00
#PBS -l nodes=1:ppn=32
#PBS -l mem=100g,vmem=100g
#PBS -m e
#######################################
# DO NOT EDIT
module load fastqc/0.11.9
#######################################
#######################################
# EDIT PATHS AND FILE NAMES AFTER: "="
INPUT=/path/to/raw/seq/reads/folder
READFORMAT= # e.g., fastq, fq, fasta, fa, etc.
OUTPUT=/path/to/desired/qc/output/folder
# LIST ALL SAMPLE NAMES SEPARATED BY SPACES AFTER: "for FILE in "
for FILE in sampleA sampleB sampleC
#######################################
#######################################
# DO NOT EDIT
do
fastqc -o $OUTPUT $INPUT/$FILE.$READFORMAT.gz
done
#######################################
Execute this shell file by submitting it as a job to the scheduler. To do this, use qsub.
qsub <fileName>.sh
#!/bin/bash
#PBS -l walltime=23:59:00
#PBS -l nodes=1:ppn=32
#PBS -l mem=50g,vmem=50g
#PBS -m e
#######################################
# DO NOT EDIT
module load fastqc/0.11.9
#######################################
#######################################
# EDIT PATHS AND FILE NAMES AFTER: "="
INPUT=/path/to/raw/seq/reads/folder
END1= # paired-end 1 file suffix; e.g., _1, _r1, _R1, .1, .r1, .R1, etc.
END2= # paired-end 2 file suffix; e.g., _2, _r2, _R2, .2, .r2, .R2, etc.
READFORMAT= # e.g., fastq, fq, fasta, fa, etc.
OUTPUT=/path/to/desired/qc/output/folder
# LIST ALL SAMPLE NAMES SEPARATED BY SPACES AFTER: "for FILE in "
for FILE in sampleA sampleB sampleC
#######################################
#######################################
# DO NOT EDIT
do
fastqc -o $OUTPUT $INPUT/$FILE$END1.$READFORMAT.gz $INPUT/$FILE$END2.$READFORMAT.gz
done
#######################################
Execute this shell file by submitting it as a job to the scheduler. To do this, use qsub.
qsub <fileName>.sh
STAR (Dobin et al., 2013) is a powerful RNA-seq aligner tool that will map RNA-seq reads to a reference genome. STAR can accept FASTA sequence files as genome input and GTF (or GFF3) annotation files as genome annotation input.
| FASTA Genome Examples | GTF Annotation Examples |
|---|---|
| GRCh38.p13.genome.fa | gencode.v34.annotation.gtf |
| lncipedia_5_2_hg38.gtf | |
| GCF_015227675.2_mRatBN7.2_genomic.fna | GCF_015227675.2_mRatBN7.2_genomic.gtf |
It is imperative that the FASTA chromosome names match the GTF chromosome names. To avoid issues regarding chromosome naming differences, try to use files from the same source together (e.g., ENSEMBL FASTA files with ENSEMBL GTF files, etc.).
Full reference genome assemblies are best downloaded from the International Genome Reference Consortium (GRC) through RefSeq (form: GCF_000001405) or NCBI’s GenBank (form: GCA_000001405). Both versions are available through GENCODE for many organism assemblies.
Genesets for Homo sapiens can be easily downloaded from a variety of reputable organizations and institutions, including:
Although these files often come compressed (.gz), they must be decompressed for use by tools in this pipeline. To do this, use gunzip.
gunzip GRCh38.p13.genome.fa.gz
gunzip gencode.v38.annotation.gtf.gz
This step need only be performed ONCE per genome assembly; it can be skipped if a STAR genome index for a particular version of a genome assembly of interest has previously been generated. By indexing the genome, STAR can access reference genome information much more efficiently, thus running quicker.
#!/bin/bash
#PBS -l walltime=23:59:00
#PBS -l nodes=1:ppn=32
#PBS -l mem=50g,vmem=50g
#PBS -m e
#######################################
# DO NOT EDIT
module load star/2.7.0f
module load samtools/1.9
#######################################
#######################################
# EDIT PATHS AND FILE NAMES AFTER: "="
GENOMEINDEX=/path/to/star/genome/indices/folder
GENOMESEQ=/path/to/genome/fasta/file.fa
#######################################
#######################################
# DO NOT EDIT
STAR --runThreadN 32 --runMode genomeGenerate --genomeDir $GENOMEINDEX --genomeFastaFiles $GENOMESEQ
samtools faidx $GENOMESEQ
#######################################
Note: After the job running the script above has completed, a
Log.outfile will printed to your root directory./detailing information about your Star run. This file is mainly for debugging, so it can be deleted if your job was successful. If you want this file to print somewhere else, specify this path as a parameter for the argument--outFileNamePrefix. E.g.,--outFileNamePrefix /path/to/output/dir.
First, write an R file (.r) that will be used to calculate the normalized expression values for each gene – as annotated by whichever genome annotation you supply – in reads per kilobase of transcript, per million mapped reads (RPKM). In this way, we effectively normalize our samples for sequencing depth and gene length.
Note the path where this R file is saved, as it will need to be referenced in the following shell script. The contents of this file are shown below. Edit the paths and file names where indicated.
#######################################
# EDIT VARIABLES TO MATCH THOSE BY THE SAME NAME IN THE SHELL SCRIPT BUT IN QUOTES HERE
OUTPUT <- "/path/to/desired/output/folder"
ANNOTYPE <- "" # e.g., GENCODE, LNCipedia, etc.
ATTRIBUTE <- "" # check column 9 of annotation file for desired gene identifier; e.g., gene_id, gene_name, gene, transcript_id, product, etc.
PREFIX <- "" # the shared prefix of all raw data being processed in this run; e.g., SRR, PM_C28_WT, etc.
#######################################
#######################################
# DO NOT EDIT
setwd(OUTPUT)
dirs <- dir(OUTPUT, pattern=c(PREFIX,"[[:print:]]+$"))
fileName <- file.path(dirs, paste0(dirs, paste0(".",ANNOTYPE,".",ATTRIBUTE,".base.counts.txt")))
newFileName <- file.path(dirs, paste0(dirs, paste0(".",ANNOTYPE,".table_rpkm.txt")))
for (i in seq(along=fileName)) {
d <- read.table(fileName[i],sep="\t",header=T)
row.names(d) <- d$Geneid
# STORE GENE LENGTHS COLUMN OF COUNTS TABLE
l <- d[,6]
# SUM COUNTS COLUMNS TO GET TOTAL MAPPED READS PER SAMPLE
cS <- colSums(d[, 7:ncol(d), drop=F])
# STORE A MATRIX OF THE READ COUNTS PER GENE
d <- d[,-c(1:6)]
# CALCULATE RPKMS FOR EACH GENE (ASSUMING GENE LENGTHS ARE NOT KB)
rpkm <- (10^9)*t(t(d/l)/cS)
# RESTORE COLUMN NAME
colnames(rpkm)[1] <- names(cS)
write.table(rpkm,file=newFileName[i],sep="\t",quote=F,row.names=F)
}
#######################################
Next, write a shell file (.sh) with the contents shown below. Edit the paths and file names where indicated. Generally, FEATURE=exon and ATTRIBUTE=gene_id are used; only change the inputs for these variables if you know what you are doing.
#!/bin/bash
#PBS -l walltime=23:59:00
#PBS -l nodes=1:ppn=32
#PBS -l mem=100g,vmem=100g
#PBS -m e
#######################################
# DO NOT EDIT
module load star/2.7.0f
module load samtools/1.9
module load rna-seqc/2.0.0
module load subread/2.0.0
module load R/3.6.1
#######################################
#######################################
# EDIT PATHS AND FILE NAMES AFTER: "="
CELLTYPE= # e.g., RPE1_Control, THP1_PMAtreated, etc.
INPUT=/path/to/raw/seq/reads/folder
READFORMAT= # e.g., fastq, fq, fasta, fa, etc.
OUTPUT=/path/to/desired/output/folder
GENOMEINDEX=/path/to/star/genome/indices/folder
ANNOTATION=/path/to/genome/annotation/file.gtf
ANNOTYPE= # e.g., GENCODE, LNCipedia, etc.
FEATURE= # check column 3 of annotation file for desired feature to count reads to; e.g., exon, transcript, gene, CDS, UTR, etc.
ATTRIBUTE= # check column 9 of annotation file for desired gene identifier; e.g., gene_id, gene_name, gene, transcript_id, product, etc.
RPKMscript=/path/to/Rscript/file.r
SAMPLE=sampleA # must match the first sample in your dataset
# LIST ALL SAMPLE NAMES SEPARATED BY SPACES AFTER: "for FILE in "
for FILE in sampleA sampleB sampleC
#######################################
#######################################
# DO NOT EDIT
do
mkdir $OUTPUT
mkdir $OUTPUT/$FILE
# MAP READS WITH STAR
STAR --runThreadN 32 \
--genomeDir $GENOMEINDEX \
--sjdbGTFfile $ANNOTATION \
--readFilesIn $INPUT/$FILE.$READFORMAT.gz \
--outFileNamePrefix $OUTPUT/$FILE/$FILE\_$ANNOTYPE\annotation_ \
--quantMode GeneCounts \
--outSAMtype BAM SortedByCoordinate \
--readFilesCommand gunzip -c
# INDEX MAPPED READ FILES WITH SAMTOOLS
samtools index $OUTPUT/$FILE/$FILE\_$ANNOTYPE\annotation_Aligned.sortedByCoord.out.bam
# QUALITY CONTROL OF MAPPED READ FILES WITH RNASEQC
rna-seqc $ANNOTATION $OUTPUT/$FILE/$FILE\_$ANNOTYPE\annotation_Aligned.sortedByCoord.out.bam $OUTPUT/$FILE/$FILE.$ANNOTYPE.rnaseqc_report --coverage --rpkm --sample $FILE.$ANNOTYPE.rnaseqc
# DETERMINE THE NUMBER OF READS MAPPED TO EACH GENE OF THE GENOME WITH FEATURECOUNTS
featureCounts -T 32 --ignoreDup -t $FEATURE -g $ATTRIBUTE -a $ANNOTATION -o $OUTPUT/$FILE/$FILE.$ANNOTYPE.$ATTRIBUTE.base.counts.txt $OUTPUT/$FILE/$FILE\_$ANNOTYPE\annotation_Aligned.sortedByCoord.out.bam
# REMOVE HEADER LINE FROM COUNTS FILE
echo "$(tail -n +2 $OUTPUT/$FILE/$FILE.$ANNOTYPE.$ATTRIBUTE.base.counts.txt)" > $OUTPUT/$FILE/$FILE.$ANNOTYPE.$ATTRIBUTE.counts.txt
# NAME THE COUNTS COLUMN
sed -i "1s/\/hpf.*bam/$FILE.Counts/" $OUTPUT/$FILE/$FILE.$ANNOTYPE.$ATTRIBUTE.counts.txt
# ISOLATE THE COUNTS COLUMN
cut -f7 $OUTPUT/$FILE/$FILE.$ANNOTYPE.$ATTRIBUTE.counts.txt > $OUTPUT/$FILE/$FILE.$ANNOTYPE.$ATTRIBUTE.countsonly.txt
# COPY GENEID DATA TO A SEPARATE FILE
awk '{print $1}' $OUTPUT/$SAMPLE/$SAMPLE.$ANNOTYPE.$ATTRIBUTE.counts.txt > $OUTPUT/$ANNOTYPE.gene_ids.txt
# ADD ALL COUNTS COLUMNS FOR EACH SAMPLE TO THE GENEID DATA
paste $OUTPUT/$ANNOTYPE.gene_ids.txt $OUTPUT/*/*.$ANNOTYPE.$ATTRIBUTE.countsonly.txt > $OUTPUT/$CELLTYPE.$ANNOTYPE.counts.consensus.txt
# CALCULATE THE NORMALIZED EXPRESSION VALUES FOR EACH GENE
Rscript $RPKMscript
# NAME THE RPKM COLUMN
sed -i "1s/X.*bam/$FILE.RPKM/" $OUTPUT/$FILE/$FILE.$ANNOTYPE.table_rpkm.txt
# ISOLATE THE RPKM COLUMN
cut -f2 $OUTPUT/$FILE/$FILE.$ANNOTYPE.table_rpkm.txt > $OUTPUT/$FILE/$FILE.$ANNOTYPE.rpkmonly.txt
# COPY GENE POSITION INFORMATION TO A SEPARATE FILE
awk 'BEGIN {OFS="\t"}; {print $2,$3,$4,$5}' $OUTPUT/$SAMPLE/$SAMPLE.$ANNOTYPE.$ATTRIBUTE.base.counts.txt > $OUTPUT/featureCoordTemp.$ANNOTYPE.txt
# REMOVE HEADER LINE FROM GENOMIC POSITION FILE
echo "$(tail -n +2 $OUTPUT/featureCoordTemp.$ANNOTYPE.txt)" > $OUTPUT/featureCoordTemp.$ANNOTYPE.txt
# PARSE AND SIMPLIFY GENOMIC POSITION INFORMATION
awk 'BEGIN {OFS="\t"}; {
split($1,array,";")
num1=split($2,array1,";")
num2=split($3,array2,";")
min=array1[1]
for(i=2;i<=num1;i++){
min=(min<array1[i]?min:array1[i])
}
max=array2[1]
for(i=2;i<=num2;i++){
max=(max>array2[i]?max:array2[i])
}
split($4,array3,";")
print array[1],min,max,array3[1]
}' $OUTPUT/featureCoordTemp.$ANNOTYPE.txt > $OUTPUT/featureCoord.$ANNOTYPE.txt
# REMOVE TEMPORARY GENOMIC POSITION FILE
rm $OUTPUT/featureCoordTemp.$ANNOTYPE.txt
# ADD ALL RPKM COLUMNS FOR EACH SAMPLE TO THE GENOMIC POSITION INFORMATION
paste $OUTPUT/$ANNOTYPE.gene_ids.txt $OUTPUT/featureCoord.$ANNOTYPE.txt $OUTPUT/*/*.$ANNOTYPE.rpkmonly.txt > $OUTPUT/$CELLTYPE.$ANNOTYPE.RPKM.genPos.consensus.txt
# ADD ALL COUNTS COLUMNS FOR EACH SAMPLE TO THE GENOMIC POSITION INFORMATION
paste $OUTPUT/$ANNOTYPE.gene_ids.txt $OUTPUT/featureCoord.$ANNOTYPE.txt $OUTPUT/*/*.$ANNOTYPE.$ATTRIBUTE.countsonly.txt > $OUTPUT/$CELLTYPE.$ANNOTYPE.counts.genPos.consensus.txt
done
#######################################
Finally, execute the newly written shell file by submitting it as a job to the scheduler. To do this, use qsub.
qsub <fileName>.sh
#!/bin/bash
#PBS -l walltime=23:59:00
#PBS -l nodes=1:ppn=32
#PBS -l mem=100g,vmem=100g
#PBS -m e
#######################################
# DO NOT EDIT
module load star/2.7.0f
module load samtools/1.9
module load rna-seqc/2.0.0
module load subread/2.0.0
module load R/3.6.1
#######################################
#######################################
# EDIT PATHS AND FILE NAMES AFTER: "="
CELLTYPE= # e.g., RPE1_Control, THP1_PMAtreated, etc.
INPUT=/path/to/raw/seq/reads/folder
END1= # paired-end 1 file suffix; e.g., _1, _r1, _R1, .1, .r1, .R1, etc.
END2= # paired-end 2 file suffix; e.g., _2, _r2, _R2, .2, .r2, .R2, etc.
READFORMAT= # e.g., fastq, fq, fasta, fa, etc.
OUTPUT=/path/to/desired/output/folder
GENOMEINDEX=/path/to/star/genome/indices/folder
ANNOTATION=/path/to/genome/annotation/file.gtf
ANNOTYPE= # e.g., GENCODE, LNCipedia, etc.
FEATURE= # check column 3 of annotation file for desired feature to count reads to; e.g., exon, transcript, gene, CDS, UTR, etc.
ATTRIBUTE= # check column 9 of annotation file for desired gene identifier; e.g., gene_id, gene_name, gene, transcript_id, product, exon_id, etc.
RPKMscript=/path/to/Rscript/file.r
SAMPLE=sampleA # must match the first sample in your dataset
# LIST ALL SAMPLE NAMES SEPARATED BY SPACES AFTER: "for FILE in "
for FILE in sampleA sampleB sampleC
#######################################
#######################################
# DO NOT EDIT
do
mkdir $OUTPUT
mkdir $OUTPUT/$FILE
# MAP READS WITH STAR
STAR --runThreadN 32 \
--genomeDir $GENOMEINDEX \
--sjdbGTFfile $ANNOTATION \
--readFilesIn $INPUT/$FILE$END1.$READFORMAT.gz $INPUT/$FILE$END2.$READFORMAT.gz \
--outFileNamePrefix $OUTPUT/$FILE/$FILE\_$ANNOTYPE\annotation_ \
--quantMode GeneCounts \
--outSAMtype BAM SortedByCoordinate \
--readFilesCommand gunzip -c
# INDEX MAPPED READ FILES WITH SAMTOOLS
samtools index $OUTPUT/$FILE/$FILE\_$ANNOTYPE\annotation_Aligned.sortedByCoord.out.bam
# QUALITY CONTROL OF MAPPED READ FILES WITH RNASEQC
rna-seqc $ANNOTATION $OUTPUT/$FILE/$FILE\_$ANNOTYPE\annotation_Aligned.sortedByCoord.out.bam $OUTPUT/$FILE/$FILE.$ANNOTYPE.rnaseqc_report --coverage --rpkm --sample $FILE.$ANNOTYPE.rnaseqc
# DETERMINE THE NUMBER OF READS MAPPED TO EACH GENE OF THE GENOME WITH FEATURECOUNTS
featureCounts -T 32 --ignoreDup -p -t $FEATURE -g $ATTRIBUTE -a $ANNOTATION -o $OUTPUT/$FILE/$FILE.$ANNOTYPE.$ATTRIBUTE.base.counts.txt $OUTPUT/$FILE/$FILE\_$ANNOTYPE\annotation_Aligned.sortedByCoord.out.bam
# REMOVE HEADER LINE FROM COUNTS FILE
echo "$(tail -n +2 $OUTPUT/$FILE/$FILE.$ANNOTYPE.$ATTRIBUTE.base.counts.txt)" > $OUTPUT/$FILE/$FILE.$ANNOTYPE.$ATTRIBUTE.counts.txt
# NAME THE COUNTS COLUMN
sed -i "1s/\/hpf.*bam/$FILE.Counts/" $OUTPUT/$FILE/$FILE.$ANNOTYPE.$ATTRIBUTE.counts.txt
# ISOLATE THE COUNTS COLUMN
cut -f7 $OUTPUT/$FILE/$FILE.$ANNOTYPE.$ATTRIBUTE.counts.txt > $OUTPUT/$FILE/$FILE.$ANNOTYPE.$ATTRIBUTE.countsonly.txt
# COPY GENEID DATA TO A SEPARATE FILE
awk '{print $1}' $OUTPUT/$SAMPLE/$SAMPLE.$ANNOTYPE.$ATTRIBUTE.counts.txt > $OUTPUT/$ANNOTYPE.gene_ids.txt
# ADD ALL COUNTS COLUMNS FOR EACH SAMPLE TO THE GENEID DATA
paste $OUTPUT/$ANNOTYPE.gene_ids.txt $OUTPUT/*/*.$ANNOTYPE.$ATTRIBUTE.countsonly.txt > $OUTPUT/$CELLTYPE.$ANNOTYPE.counts.consensus.txt
# CALCULATE THE NORMALIZED EXPRESSION VALUES FOR EACH GENE
Rscript $RPKMscript
# NAME THE RPKM COLUMN
sed -i "1s/X.*bam/$FILE.RPKM/" $OUTPUT/$FILE/$FILE.$ANNOTYPE.table_rpkm.txt
# ISOLATE THE RPKM COLUMN
cut -f2 $OUTPUT/$FILE/$FILE.$ANNOTYPE.table_rpkm.txt > $OUTPUT/$FILE/$FILE.$ANNOTYPE.rpkmonly.txt
# COPY GENE POSITION INFORMATION TO A SEPARATE FILE
awk 'BEGIN {OFS="\t"}; {print $2,$3,$4,$5}' $OUTPUT/$SAMPLE/$SAMPLE.$ANNOTYPE.$ATTRIBUTE.base.counts.txt > $OUTPUT/featureCoordTemp.$ANNOTYPE.txt
# REMOVE HEADER LINE FROM GENOMIC POSITION FILE
echo "$(tail -n +2 $OUTPUT/featureCoordTemp.$ANNOTYPE.txt)" > $OUTPUT/featureCoordTemp.$ANNOTYPE.txt
# PARSE AND SIMPLIFY GENOMIC POSITION INFORMATION
awk 'BEGIN {OFS="\t"}; {
split($1,array,";")
num1=split($2,array1,";")
num2=split($3,array2,";")
min=array1[1]
for(i=2;i<=num1;i++){
min=(min<array1[i]?min:array1[i])
}
max=array2[1]
for(i=2;i<=num2;i++){
max=(max>array2[i]?max:array2[i])
}
split($4,array3,";")
print array[1],min,max,array3[1]
}' $OUTPUT/featureCoordTemp.$ANNOTYPE.txt > $OUTPUT/featureCoord.$ANNOTYPE.txt
# REMOVE TEMPORARY GENOMIC POSITION FILE
rm $OUTPUT/featureCoordTemp.$ANNOTYPE.txt
# ADD ALL RPKM COLUMNS FOR EACH SAMPLE TO THE GENOMIC POSITION INFORMATION
paste $OUTPUT/$ANNOTYPE.gene_ids.txt $OUTPUT/featureCoord.$ANNOTYPE.txt $OUTPUT/*/*.$ANNOTYPE.rpkmonly.txt > $OUTPUT/$CELLTYPE.$ANNOTYPE.RPKM.genPos.consensus.txt
# ADD ALL COUNTS COLUMNS FOR EACH SAMPLE TO THE GENOMIC POSITION INFORMATION
paste $OUTPUT/$ANNOTYPE.gene_ids.txt $OUTPUT/featureCoord.$ANNOTYPE.txt $OUTPUT/*/*.$ANNOTYPE.$ATTRIBUTE.countsonly.txt > $OUTPUT/$CELLTYPE.$ANNOTYPE.counts.genPos.consensus.txt
done
#######################################
Finally, execute the newly written shell file by submitting it as a job to the scheduler. To do this, use qsub.
qsub <fileName>.sh
At the top of most scripts in this protocol, where you are supposed to curate the variables to suit your data, make sure you are adhering to the these strict rules:
For shell script variables, type immediately following the “=” sign. Do NOT add a space between the variable name and your text, and do NOT put your text in single or double quotes. This applies to strings and file/folder paths.
For R script variables, type between the double quotes that follow the “<-” sign. Do NOT add a space between the opening or closing quotation mark and your text. This applies to strings and file/folder paths. Double-check whether you have accidentally included extra single or double quotes within the required set already in the template.
In the R script, you must specify the prefix shared by all samples you wish to analyze in this job. The more specific (/longer) the shared prefix, the better. This is particularly important if you are analyzing a new subset of samples and storing their results in an output folder you have previously stored results in. You do not want RPKMs to be (re-)calculated for all samples in that folder, rather, only those files part of this latest job which have a shared prefix that is distinct from previously analyzed samples in the same destination folder.
After mapping and counting the reads, it is wise to build a table of quality control (QC) metrics. At this point in the RNA-seq pipeline, there have been three main QC checkpoints: (i) the pre-mapping FastQC output, (ii) the STAR mapping output, and (iii) the post-mapping RNA-SeQC ouptut.
From the FastQC output, you should observe the .html results in a web browser to properly check the output figures. Beyond that, though, it is good practice to note the Total Sequences, the Sequence length, and %GC stats. To view these metrics from the command line, use the following command:
# to print the full contents of all FASTQC files to standard input, then comb through each for desired stats
lynx -dump /path/to/fastqcDir/*_fastqc.html
# or
# to print only the desired stat lines for one FASTQC file at a time
lynx -dump <sampleName>_fastqc.html | grep -Hn 'Total Sequences\|Sequence length \|%GC'
Record the total number of sequences and read length for each sample in a spreadsheet.
From the STAR output, it is important to note the Number of input reads (line 6), the Average input read length (line 7), the Uniquely mapped reads number (line 9), and the Uniquely mapped reads % (line 10) metrics. To extract key mapping stats from the STAR log files, use the following command:
grep -Hin 'Number of input reads\|Average input read length\|Uniquely mapped reads number\|Uniquely mapped reads %' /path/to/STARoutputDir/*/*_Log.final.out
Note: If you have paired-end reads, the average input read length will be exactly 2X as large as expected.
Ensure the average input read length aligns with the value presented by FastQC. Record the number of input reads, the number of mapped reads, and the % mapped reads.
From the RNA-SeQC output, it is important to note the Mapped Reads (line 42), the Total Reads (line 51), and the Read Length (line 54) metrics. To extract key QC metrics from the RNA-SeQC metrics files, use the following command:
grep -Hin '^Mapped Reads\|Total Reads\|Read Length' /path/to/STARoutputDir/*/*.rnaseqc_report/*.rnaseqc.metrics.tsv
Ensure the average input read length aligns with the value presented by FastQC. Record the number of input reads, the number of mapped reads. Unfortunately, RNA-SeQC does not provide the % mapped reads, so you will have to calculate this from the number of input reads and mapped reads; use a Microsoft Excel function to perform the calculation, or adapt some code for the built-in basic calculator (bc) installed on most Linux machines to evaluate the simple expression echo "scale=4; <numMappedReads> / <numInputReads>" | bc -l.
The final table should look similar to the following example:
| Sample | Cell type | Condition | Replicate | Avg read length | Number of input reads | Number of mapped reads | % Mapped reads | Number of input reads | Number of mapped reads | % Mapped reads |
|---|---|---|---|---|---|---|---|---|---|---|
| SRR0001234_KO1_S01_L001 | RPMI | KO | 1 | 151 | 67484983 | 59799317 | 88.61 | 145114308 | 126153146 | 86.93 |
| SRR0001235_KO1_S01_L002 | RPMI | KO | 1 | 151 | 54037361 | 47990864 | 88.81 | 109585370 | 100332682 | 91.56 |
| SRR0001236_KO2_S02_L001 | RPMI | KO | 2 | 151 | 44785818 | 39059650 | 87.21 | 89343410 | 81736108 | 91.49 |
| SRR0001237_WT1_S03_L001 | RPMI | WT | 1 | 151 | 50258972 | 44434815 | 88.41 | 101977068 | 93020954 | 91.22 |
| SRR0001238_WT2_S04_L001 | RPMI | WT | 2 | 151 | 73801621 | 65029508 | 88.11 | 156557084 | 137278094 | 87.69 |
| SRR0001239_WT2_S04_L002 | RPMI | WT | 2 | 151 | 51791198 | 46018544 | 88.85 | 109433578 | 96906298 | 88.55 |
| SRR0009876_UP1_S05_L001 | RPMI | UP | 1 | 151 | 48162197 | 42939308 | 89.16 | 99361950 | 89973546 | 90.55 |
To assess the quality of the raw data and mapping performance, metrics and stats from the QC programs implemented in the mapping script above can be visualized as plots. For this, we will use R.
The following packages are required to be installed to perform the analyses described thereafter:
| Package | Version |
|---|---|
| ggplot2 | 3.3.5 |
| gridExtra | 2.3 |
First, ensure all your Star ‘summary mapping statistics’ files (*Log.final.out) of interest are in a single subdirectory of your current working directory in R. Then, store the full path of each QC file in a character vector, load them all into R, and modify each into a more computational-friendly format.
# list files to be read-in
infiles <- list.files(path="starQC_CRISPRdisplay_RPMI/", # folder with log files
pattern="Log.final.out",
full.names=TRUE)
# iterate over the file list to generate a list object of data frames
align.results <- lapply(infiles, function(x)
read.table(x, sep="|", strip.white=TRUE, stringsAsFactors=FALSE, skip=3, fill=TRUE, header=FALSE))
# remove "%" character from some values to force to numeric
align.results <- lapply(align.results, function(x)
transform(x, V2=as.numeric(gsub("%", "", x$V2))))
# apply file names to each data frame (become identifying row names in next step)
names(align.results) <- gsub(".*PM_(.*)_S[0-9].*", "\\1", infiles)
# concatenate all data frames from the list object into one large data frame
align.results.df <- as.data.frame(do.call(rbind, align.results))
# remove lines without values
align.results.df <- align.results.df[complete.cases(align.results.df),]
The next step is highly dependent on the unique naming structure used for your data, thus, the code below requires considerable modification using regular expressions (regex). The goal is to extract the sample ID from each row’s corresponding row name and add it to a new column sample, and to do the same for the replicate number and add it to a new column replicate.
# add sample and replicate ID columns using info from the row names
align.results.df$sample <- gsub("^(.*)CD_(WT|KO)[0-9]*(.*)\\.\\d+$", "\\1\\2\\3", row.names(align.results.df))
align.results.df$replicate <- as.factor(as.numeric(gsub(".*[WT|KO]([0-9]*).*", "\\1", row.names(align.results.df))))
Parsing a string with Regex (click to expand)
In the example above, our row names are of the form, e.g., “RPMI_CD_KO1_C17.1”.
To extract the sample ID…
The first argument of the
gsub()function is:"^(.*)CD_(WT|KO)[0-9]*(.*)\\.\\d+$"where
^= string beginning anchor,()= a substring to be captured,.*= greedy match to any character(s),|= an “OR” logic operator,[0-9]= a character class matching one number between zero and nine (same as\d),*= match the preceding character 0 or more times,\\= a regex escape sequence for a literal character,\= an escape sequence for a literal character,\d= match a single digit character (same as[0-9]),$= string end anchor.The second argument of the
gsub()function is:"\\1\\2\\3"where
\\1= retrieve the first capture group (AKA “back reference”) from the input string sequence.The third argument of the
gsub()function is:row.names(align.results.df)where the input is the string to be parsed bygsub.
To extract the replicate number…
Apply the same principles of the regex pattern matching described above.
Now that the data has sample and replicate information, we can choose which QC metrics from the Star summary stats file we wish to visualize as bar plots.
# define STAR stats of interest (use unique(align.results.df$V1) to see possibilities)
filters <- c("Number of input reads",
"Average input read length",
"Uniquely mapped reads number",
"Uniquely mapped reads %",
"Average mapped length")
Custom functions are needed to plot multiple style-matched plots in the same graphical space with one common legend.
# create functions to plot style-matched figures
PlottingAlignmentResults <- function(Filter, DF, Legend=TRUE, PlotMedian = TRUE){
# this function extracts lines corresponding to the value provided to Filter and generates a bar
# plot for each where each replicate is represented by a different color
library(ggplot2)
library(grid) # for unit() function
filtered.df <- DF[which(DF$V1 == Filter),]
medians <- as.data.frame(aggregate(V2~sample, data=filtered.df, FUN=median))
filtered.df <- merge(filtered.df, medians, by.x = "sample", by.y = "sample", all.x=TRUE)
p <- ggplot(data=filtered.df, aes(fill=replicate, y=V2.x, x=sample)) +
geom_bar(stat="identity", position=position_dodge()) +
theme_bw(base_size = 10) +
theme(plot.title = element_text(face="bold", size=12, hjust=0.5),
legend.position="bottom",
legend.text = element_text(size = 8),
legend.key.size = unit(0.3, "cm"),
legend.title=element_blank()) +
coord_flip() + ylab("") + ggtitle(Filter)
if(PlotMedian){
p <- p + geom_errorbar(aes(y=V2.y, ymax=V2.y, ymin=V2.y), linetype="dashed")
}
if(!Legend){
p <- p + theme(legend.position="none")
}
return(p)
}
ExtractLegend <- function(Plot){
library(ggplot2)
G <- ggplotGrob(Plot)$grobs
Legend <- G[[which(sapply(G, function(x) x$name) == "guide-box")]]
Lheight <- sum(Legend$height)
return(list(legend = Legend, lheight=Lheight))
}
Finally, print the combined QC metrics plot.
# generate a bar plot for each entry in 'filters' vector and build a single legend for all
plots <-lapply(filters, function(x)
PlottingAlignmentResults(x, align.results.df, Legend=FALSE))
my.legend <- ExtractLegend(PlottingAlignmentResults(align.results.df, Filter=filters[1], Legend=TRUE))
# combine plots and legend
gridExtra::grid.arrange(gridExtra::arrangeGrob(plots[[1]], plots[[2]], plots[[3]], plots[[4]], plots[[5]], nrow=2),
my.legend$legend, nrow=2, heights=unit.c(unit(1, "npc") - my.legend$lheight, my.legend$lheight))
If you wish to view which samples have a high mapping rate (>75%), simply subset the data frame using the Uniquely mapped reads % filter.
# print a subset of the table for all samples with high mapping rate (>75%)
subset(align.results.df, V1 == filters[4] & V2 > 75)
## V1 V2 sample replicate
## RPMI_CD_KO1_C17.6 Uniquely mapped reads % 88.61 RPMI_KO_C17 1
## RPMI_CD_KO1_C9.6 Uniquely mapped reads % 88.81 RPMI_KO_C9 1
## RPMI_CD_KO1_CC.6 Uniquely mapped reads % 87.21 RPMI_KO_CC 1
## RPMI_CD_KO1_CUnd.6 Uniquely mapped reads % 88.41 RPMI_KO_CUnd 1
## RPMI_CD_KO1_dCas9.6 Uniquely mapped reads % 88.11 RPMI_KO_dCas9 1
## RPMI_CD_KO1_F17.6 Uniquely mapped reads % 88.85 RPMI_KO_F17 1
And if you want to see if there is a correlation between any two QC metric variables, create the function below and provide the filters of interest.
# generate a dot plot to visualize the correlation between two QC variables
PlottingCorrelation <- function(DF, Var1, Var2, Var1.Label, Var2.Label){
# usage: PlottingCorrelation(DF=aligned.reads.df,
# Var1="Number of input reads", Var2="Uniquely mapped reads %",
# Var1.Label = "InputReads", Var2.Label="UniquelyMappedFraction")
m <- matrix(data = c(DF$V2[which(DF$V1 == Var1)],
DF$V2[which(DF$V1 == Var2)]),
ncol = 2)
colnames(m) <- c(Var1.Label,Var2.Label)
plot(m)
}
sampletype <- "WT"
PlottingCorrelation(DF = subset(align.results.df, grepl(sampletype, row.names(align.results.df))),
Var1 = "Uniquely mapped reads number", Var2 = filters[4],
Var1.Label = paste("No. of uniquely mapped reads (",sampletype,")", sep=""), Var2.Label = paste(filters[4]," (",sampletype,")", sep=""))
Andrews, S. (2010). FASTQC. A quality control tool for high throughput sequence data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. Accessed 16 September 2021.
Dobin, A., Davis, C. A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P., Chaisson, M., & Gingeras, T. R. (2013). STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29(1), 15–21. https://doi.org/10.1093/bioinformatics/bts635